Work Note:
- Due to computational time and my system limitation, I have used 15% of the random data in this exercise. I have checked the distribution and other descreptive of the sample data, and it matches the original data. - I have used MinMax scaler because it preserves the shape of the original distribution. Doesn't reduce the importance of outliers. Least disruptive to the information in the original data. - K fold cross validation limited to 5 due to computational time, usin CV 10 will generalized the model performance better. - Link to data: https://www.kaggle.com/edumagalhaes/quality-prediction-in-a-mining-process

Context.

Content.

Explore data (EDA)


Removal of Unwanted variables.

Feature selection.


Model Building approach.

Lasso Basic Model

Ridge Basic model Model

ElasticNet Basic Model

Decision Tree Basic Model

ExtraTree Regressor Basic.

Random Forest Basic Model

Bagging Basic Model

AdbaBoost Basic Model

Gradient Boost Basic Model

XGBoost Basic Model

CatBoostRegressor Basic Model

LightGBM Regressor Basic Model

Linear Regression Model

Checking the Linear Regression Assumptions

 1) No Multicollinearity
 2) Mean of residuals should be 0 
 3) No Heteroscedasticity
 4) Linearity of variables
 5) Normality of error terms

Checking Assumption 1: No Multicollinearity

Checking Assumption 2: Mean of residuals should be 0

Checking Assumption 3: No Heteroscedasticity

Checking Assumption 4: Linearity of variables

Checking Assumption 5: Normality of error terms.

Coefficeint Analysis

Model Tuning

Note:
- Above parameters overfits the model. After few trail and runs below are the parameters which will reduce the overfitting and compartively low errors(residuals).
Note:
- I have change the max_depth to 18 to avoid overfitting, next time when we run the grid search we should not be using max_depth=None, this cause model to overfit.
Note:
- From the above you will not see max_deph because it has selected default of None. - In the below I will use n_estimators as 200 to increase the model performance score, I did not use n_estimators in Grid search because of the computation time.
Note:
- Above Grid search parameters overfits the model, I have change the max_depth to 10, min_child_weight to 3 and subsample to 60% to avoid overfitting.

HyperOpt Regressor

Gradient Boost HyperOpt

CatBoost HyperOpt

LGBM HyperOpt


Compare Models Performance.


Feature Importance from the Extra Tree Grid Search


Prediction Analysis on Test Data.


Insights

The End